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Abstract 

We  present  a  document  compression  sys¬ 
tem  that  uses  a  hierarchical  noisy-channel 
model  of  text  production.  Our  compres¬ 
sion  system  first  automatically  derives  the 
syntactic  structure  of  each  sentence  and 
the  overall  discourse  structure  of  the  text 
given  as  input.  The  system  then  uses  a  sta¬ 
tistical  hierarchical  model  of  text  produc¬ 
tion  in  order  to  drop  non-important  syn¬ 
tactic  and  discourse  constituents  so  as  to 
generate  coherent,  grammatical  document 
compressions  of  arbitrary  length.  The  sys¬ 
tem  outperforms  both  a  baseline  and  a 
sentence-based  compression  system  that 
operates  by  simplifying  sequentially  all 
sentences  in  a  text.  Our  results  support 
the  claim  that  discourse  knowledge  plays 
an  important  role  in  document  summariza¬ 
tion. 

1  Introduction 

Single  document  summarization  systems  proposed 

to  date  fall  within  one  of  the  following  three  classes: 

Extractive  summarizers  simply  select  and  present 
to  the  user  the  most  important  sentences  in 
a  text  —  see  (Mani  and  Maybury,  1999; 
Marcu,  2000;  Mani,  2001)  for  comprehensive 
overviews  of  the  methods  and  algorithms  used 
to  accomplish  this. 

Headline  generators  are  noisy-channel  probabilis¬ 
tic  systems  that  are  trained  on  large  corpora 
of  {Headline,  Text)  pairs  (Banko  et  ah,  2000; 


Berger  and  Mittal,  2000).  These  systems  pro¬ 
duce  short  sequences  of  words  that  are  indica¬ 
tive  of  the  content  of  the  text  given  as  input. 
Sentence  simplification  systems  (Chandrasekar  et 
ah,  1996;  Mahesh,  1997;  Carroll  et  ah,  1998; 
Grefenstette,  1998;  Jing,  2000;  Knight  and 
Marcu,  2000)  are  capable  of  compressing  long 
sentences  by  deleting  unimportant  words  and 
phrases. 

Extraction-based  summarizers  often  produce  out¬ 
puts  that  contain  non-important  sentence  fragments. 
For  example,  the  hypothetical  extractive  summary 
of  Text  (1),  which  is  shown  in  Table  1,  can  be  com¬ 
pacted  further  by  deleting  the  clause  “which  is  al¬ 
ready  almost  enough  to  win”.  Headline -based  sum¬ 
maries,  such  as  that  shown  in  Table  1,  are  usually 
indicative  of  a  text’s  content  but  not  informative, 
grammatical,  or  coherent.  By  repeatedly  applying  a 
sentence-simplification  algorithm  one  sentence  at  a 
time,  one  can  compress  a  text;  yet,  the  outputs  gen¬ 
erated  in  this  way  are  likely  to  be  incoherent  and 
to  contain  unimportant  information.  When  summa¬ 
rizing  text,  some  sentences  should  be  dropped  alto¬ 
gether. 

Ideally,  we  would  like  to  build  systems  that  have 
the  strengths  of  all  these  three  classes  of  approaches. 
The  “Document  Compression”  entry  in  Table  1 
shows  a  grammatical,  coherent  summary  of  Text  (1), 
which  was  generated  by  a  hypothetical  document 
compression  system  that  preserves  the  most  impor¬ 
tant  information  in  a  text  while  deleting  sentences, 
phrases,  and  words  that  are  subsidiary  to  the  main 
message  of  the  text.  Obviously,  generating  coher¬ 
ent,  grammatical  summaries  such  as  that  produced 
by  the  hypothetical  document  compression  system 
in  Table  1  is  not  trivial  because  of  many  conflicting 
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Type  of 
Summarizer 

Hypothetical  output 

Output 
contains  only 
important  info 

Output  is 
coherent 

Output  is 
grammatical 

Extractive 

summarizer 

John  Doe  has  already  secured  the  vote  of  most 
democrats  in  his  constituency,  which  is  already 
almost  enough  to  win.  But  without  the  support 
of  the  governer,  he  is  still  on  shaky  ground. 

V 

Headline 

generator 

mayor  vote  constituency  governer 

V 

Sentence 

simplifier 

The  mayor  is  now  looking  for  re-election.  John  Doe 
has  already  secured  the  vote  of  most  democrats 
in  his  constituency.  He  is  still  on  shaky  ground. 

V 

Document 

compressor 

John  Doe  has  secured  the  vote  of  most  democrats. 

But  he  is  still  on  shaky  ground. 

V 

V 

Table  1:  Hypothetical  outputs  generated  by  various  types  of  summarizers. 


goals  ^  The  deletion  of  certain  sentences  may  result 
in  incoherence  and  information  loss.  The  deletion  of 
certain  words  and  phrases  may  also  lead  to  ungram- 
maticality  and  information  loss. 

The  mayor  is  now  looking  for  re-election.  John  Doe  (1) 
has  already  secured  the  vote  of  most  democrats  in  his 
constituency,  which  is  already  almost  enough  to  win. 

But  without  the  support  of  the  governer,  he  is  still  on 
shaky  grounds. 

In  this  paper,  we  present  a  document  compression 
system  that  uses  hierarchical  models  of  discourse 
and  syntax  in  order  to  simultaneously  manage  all 
these  conflicting  goals.  Our  compression  system 
first  automatically  derives  the  syntactic  structure  of 
each  sentence  and  the  overall  discourse  structure  of 
the  text  given  as  input.  The  system  then  uses  a  sta¬ 
tistical  hierarchical  model  of  text  production  in  or¬ 
der  to  drop  non-important  syntactic  and  discourse 
units  so  as  to  generate  coherent,  grammatical  doc¬ 
ument  compressions  of  arbitrary  length.  The  system 
outperforms  both  a  baseline  and  a  sentence-based 
compression  system  that  operates  by  simplifying  se¬ 
quentially  all  sentences  in  a  text. 

2  Document  Compression 

The  document  compression  task  is  conceptually 
simple.  Given  a  document  D  =  {wiW2  ■  ■  ■  Wn),  our 
goal  is  to  produce  a  new  document  D'  by  “dropping” 
words  Wi  from  D.  In  order  to  achieve  this  goal,  we 

'a  number  of  other  systems  use  the  outputs  of  extrac¬ 
tive  summarizers  and  repair  them  to  improve  coherence  (DUC, 
2001;  DUC,  2002).  Unfortunately,  none  of  these  seems  flexible 
enough  to  produce  in  one  shot  good  summaries  that  are  simul¬ 
taneously  coherent  and  grammatical. 


extent  the  noisy-channel  model  proposed  by  Knight 
&  Marcu  (2000).  Their  system  compressed  sen¬ 
tences  by  dropping  syntactic  constituents,  but  could 
be  applied  to  entire  documents  only  on  a  sentence- 
by-sentence  basis.  As  discussed  in  Section  1,  this 
is  not  adequate  because  the  resulting  summary  may 
contain  many  compressed  sentences  that  are  irrele¬ 
vant.  In  order  to  extend  Knight  &  Marcu ’s  approach 
beyond  the  sentence  level,  we  need  to  “glue”  sen¬ 
tences  together  in  a  tree  structure  similar  to  that  used 
at  the  sentence  level.  Rhetorical  Structure  Theory 
(RST)  (Mann  and  Thompson,  1988)  provides  us  this 
“glue.” 

The  tree  in  Figure  1  depicts  the  RST  structure 
of  Text  (1).  In  RST,  discourse  structures  are  non¬ 
binary  trees  whose  leaves  correspond  to  elementary 
discourse  units  (EDUs),  and  whose  internal  nodes 
correspond  to  contiguous  text  spans.  Each  internal 
node  in  an  RST  tree  is  characterized  by  a  rhetor¬ 
ical  relation.  For  example,  the  first  sentence  in 
Text  (1)  provides  background  information  for  inter¬ 
preting  the  information  in  sentences  2  and  3,  which 
are  in  a  Contrast  relation  (see  Figure  1).  Each  re¬ 
lation  holds  between  two  adjacent  non-overlapping 
text  spans  called  nucleus  and  satellite.  (There  are 
a  few  exceptions  to  this  rule:  some  relations,  such 
as  LIST  and  contrast,  are  multinuclear.)  The  dis¬ 
tinction  between  nuclei  and  satellites  comes  from 
the  empirical  observation  that  the  nucleus  expresses 
what  is  more  essential  to  the  writer’s  purpose  than 
the  satellite. 

Our  system  is  able  to  analyze  both  the  discourse 
structure  of  a  document  and  the  syntactic  structure 
of  each  of  its  sentences  or  EDUs.  It  then  compresses 


the  document  by  dropping  either  syntactic  or  dis¬ 
course  constituents. 

3  A  Noisy-Channel  Model 

For  a  given  document  D,  we  want  to  find  the 
summary  text  S  that  maximizes  F’(5'|i^).  Using 
Bayes  rule,  we  flip  this  so  we  end  up  maximizing 
P{D\S)P{S).  Thus,  we  are  left  with  modelling  two 
probability  distributions:  P{D\S) ,  the  probability  of 
a  document  D  given  a  summary  S,  and  P{S),  the 
probability  of  a  summary.  We  assume  that  we  are 
given  the  discourse  structure  of  each  document  and 
the  syntactic  structures  of  each  of  its  EDUs. 

The  intuitive  way  of  thinking  about  this  applica¬ 
tion  of  Bayes  rule,  reffered  to  as  the  noisy-channel 
model,  is  that  we  start  with  a  summary  S  and  add 
“noise”  to  it,  yielding  a  longer  document  D.  The 
noise  added  in  our  model  consists  of  words,  phrases 
and  discourse  units. 

For  instance,  given  the  document  “John  Doe  has 
secured  the  vote  of  most  democrats.”  we  could  add 
words  to  it  (namely  the  word  “already”)  to  gener¬ 
ate  “John  Doe  has  already  secured  the  vote  of  most 
democrats.”  We  could  also  choose  to  add  an  en¬ 
tire  syntactic  constituent,  for  instance  a  prepositional 
phrase,  to  generate  “John  Doe  has  secured  the  vote 
of  most  democrats  in  his  constituency.”  These  are 
both  examples  of  sentence  expansion  as  used  previ¬ 
ously  by  Knight  &  Marcu  (2000). 

Our  system,  however,  also  has  the  ability  to  ex¬ 
pand  on  a  core  message  by  adding  discourse  con¬ 
stituents.  For  instance,  it  could  decide  to  add  another 
discourse  constituent  to  the  original  summary  “John 
Doe  has  secured  the  vote  of  most  democrats”  by 
coNTRAsxing  the  information  in  the  summary  with 
the  uncertainty  regarding  the  support  of  the  gover¬ 
nor,  thus  yielding  the  text:  “John  Doe  has  secured 
the  vote  of  most  democrats.  But  without  the  support 
of  the  governor,  he  is  still  on  shaky  ground!' 

As  in  any  noisy-channel  application,  there  are 
three  parts  that  we  have  to  account  for  if  we  are  to 
build  a  complete  document  compression  system:  the 
channel  model,  the  source  model  and  the  decoder. 
We  describe  each  of  these  below. 

The  source  model  assigns  to  a  string  the  probabil¬ 
ity  P{S),  the  probability  that  the  summary  S 
is  good  English.  Ideally,  the  source  model 
should  disfavor  ungrammatical  sentences  and 


documents  containing  incoherently  juxtaposed 
sentences. 

The  channel  model  assigns  to  any  docu¬ 
ment/summary  pair  a  probability  F’(iJ|5'). 
This  models  the  extent  to  which  iJ  is  a  good 
expansion  of  S.  For  instance,  if  S  is  “The 
mayor  is  now  looking  for  re-election.”,  Di  is 
“The  mayor  is  now  looking  for  re-election. 
He  has  to  secure  the  vote  of  the  democrats.” 
and  D2  is  “The  major  is  now  looking  for 
re-election.  Sharks  have  sharp  teeth.”,  we 
expect  ^(111 15)  to  be  higher  than  P{D2\S) 
because  Di  expands  on  S  by  elaboration, 
while  D2  shifts  to  a  different  topic,  yielding  an 
incoherent  text. 

The  decoder  searches  through  all  possible  sum¬ 
maries  of  a  document  D  for  the  summary 
S  that  maximizes  the  posterior  probability 
P{D\S)P{S). 

Each  of  these  parts  is  described  below. 

3.1  Source  model 

The  job  of  the  source  model  is  to  assign  a  score 
P{S)  to  a  compression  independent  of  the  original 
document.  That  is,  the  source  model  should  measure 
how  good  English  a  summary  is  (independent  of 
whether  it  is  a  good  compression  or  not).  Currently, 
we  use  a  bigram  measure  of  quality  (trigram  scores 
were  also  tested  but  failed  to  make  a  difference), 
combined  with  non-lexicalized  context-free  syntac¬ 
tic  probabilities  and  context-free  discourse  probabil¬ 
ities,  giving  P{S)  =  Pbigram{S)  *  Ppcfg{S)  * 
Pdpcfg{S) .  It  would  be  better  to  use  a  lexical- 
ized  context  free  grammar,  but  that  was  not  possible 
given  the  decoder  used. 

3.2  Channel  model 

The  channel  model  is  allowed  to  add  syntactic 
constituents  (through  a  stochastic  operation  called 
constituent-expand)  or  discourse  units  (through  an¬ 
other  stochastic  operation  called  EDU-expand). 
Both  of  these  operations  are  performed  on  a  com¬ 
bined  discourse/syntax  tree  called  the  DS-tree.  The 
DS-tree  for  Text  (1)  is  shown  in  Figure  1  for  refer¬ 
ence. 

Suppose  we  start  with  the  summary  S  =  “The 
mayor  is  looking  for  re-election.”  A  constituent- 


Root 


expand  operation  eould  insert  a  syntaetie  eon- 
stituent,  sueh  as  “this  year”  anywhere  in  the  syntae¬ 
tie  tree  of  5.  A  eonstituent-expand  operation  eould 
also  add  single  words:  for  instanee  the  word  “now” 
eould  be  added  between  “is”  and  “looking,”  yielding 
D  =  “The  mayor  is  now  looking  for  re-eleetion.” 
The  probability  of  inserting  this  word  is  based  on 
the  syntaetie  strueture  of  the  node  into  whieh  it’s  in¬ 
serted. 

Knight  and  Mareu  (2000)  deseribe  in  detail  a 
noisy-ehannel  model  that  explains  how  short  sen- 
tenees  ean  be  expanded  into  longer  ones  by  inserting 
and  expanding  syntaetie  eonstituents  (and  words). 
Sinee  our  eonstituent-expand  stoehastie  operation 
simply  reimplements  Knight  and  Mareu ’s  model,  we 
do  not  foeus  on  them  here.  We  refer  the  reader 
to  (Knight  and  Mareu,  2000)  for  the  details. 

In  addition  to  adding  syntaetie  eonstituents,  our 
system  is  also  able  to  add  diseourse  units.  Consider 
the  summary  S  =  “John  Doe  has  already  seeured  the 
vote  of  most  demoerats  in  his  eonsitueney.”  Through 
a  sequenee  of  diseourse  expansions,  we  ean  expand 
upon  this  summary  to  reaeh  the  original  text.  A  eom- 
plete  diseourse  expansion  proeess  that  would  oeeur 
starting  from  this  initial  summary  to  generate  the 
original  doeument  is  shown  in  Figure  2. 

In  this  figure,  we  ean  follow  the  sequenee  of 
steps  required  to  generate  our  original  text,  begin¬ 
ning  with  our  summary  S.  First,  through  an  op¬ 
eration  D-Project  (“D”  for  “D”iseourse),  we  in- 
erease  the  depth  of  the  tree,  adding  an  intermediate 


Nuc=Span  node.  This  projeetion  adds  a  faetor  of 
F’(Nuc=Span  — ^  Nuc=Span|  Nuc=Span)  to  the  probabil¬ 
ity  of  this  sequenee  of  operations  (as  is  shown  under 
the  arrow). 

We  are  now  able  to  perform  the  seeond  operation, 
D-Expand,  with  whieh  we  expand  on  the  eore  mes¬ 
sage  eontained  in  S  by  adding  a  satellite  whieh  eval¬ 
uates  the  information  presented  in  S.  This  expansion 
adds  the  probability  of  performing  the  expansion 
(ealled  the  diseourse  expansion  probabilities,  Pde- 
An  example  diseourse  expansion  probability,  writ¬ 
ten  F’(Nuc=Span  — ^  Nuc=Span  Sat=Eval|  Nuc=Span  — ^ 
Nuc=Span) ,  refleets  the  probability  of  adding  an  eval¬ 
uation  satellite  onto  a  nuelear  span). 

The  rest  of  Figure  2  shows  some  of  the  remaining 
steps  to  produee  the  original  doeument,  eaeh  step  la¬ 
beled  with  the  appropriate  probability  faetors.  Then, 
the  probability  of  the  entire  expansion  is  the  prod¬ 
uet  of  all  those  listed  probabilities  eombined  with 
the  appropriate  probabilities  from  the  syntax  side  of 
things.  In  order  to  produee  the  final  seore  F’(iJ|5') 
for  a  doeumenf/summary  pair,  we  mulfiply  fogefher 
eaeh  of  fhe  expansion  probabilities  in  fhe  pafh  lead¬ 
ing  from  S  to  D. 

For  esfimafing  fhe  paramefers  for  fhe  diseourse 
models,  we  used  an  RST  eorpus  of  385  Wall  Sfreef 
Journal  arfieles  from  fhe  Penn  Treebank,  whieh  we 
obfained  from  LDC.  The  doeumenfs  in  fhe  eorpus 
range  in  size  from  31  fo  2124  words,  wifh  an  av¬ 
erage  of  458  words  per  doeumenf.  Eaeh  doeumenf 
is  paired  wifh  a  diseourse  sfruefure  fhaf  was  manu- 
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John  Doe  has  already 
secured  the  vole  of 
most  democrats  in  his 
constituency. 


PiNuc^Span  ->  Nuc^Span 
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NucsSpan 


John  Doe  has  already 
secured  the  vote  of 
most  democrats  in  his 
constituency. 


Nuc=Span  —>  Nuc=Spani 


John  Doe  has  already 
secured  the  vote  of 
most  democrats  in  his 
constituency, 


D-Project  _ 

P(Niic^Span  ->  Nuc^Cnnlrast  I  Nuc^Sp 


NucsSpan 


John  Doe  has  alre. 
secured  the  vote  o 
most  democrats  in 
constituency. 


which  is  already 
almost  enough 


Nuc=Spim  —>  Nuc— Contrast] 


P{Nuc— Contrast  —>  Nuc—Span  I  Nuc=Con 
P{Nuc— Contrast  —>  Sat=condtation  Nuc=Span  I 
Nuc=Conlrast  —  >  Nuc=Spai}) 


Figure  2:  A  sequence  of  discourse  expansions  for  Text  (1)  (with  probability  factors). 


ally  built  in  the  style  of  RST.  (See  (Carlson  et  ah, 
2001)  for  details  concerning  the  corpus  and  the  an¬ 
notation  process.)  From  this  corpus,  we  were  able 
to  estimate  parameters  for  a  discourse  PCFG  using 
standard  maximum  likelihood  methods. 

Furthermore,  150  document  from  the  same  corpus 
are  paired  with  extractive  summaries  on  the  EDU 
level.  Human  annotators  were  asked  which  EDUs 
were  most  important;  suppose  in  the  example  DS- 
tree  (Eigure  1)  the  annotators  marked  the  second 
and  fifth  EDUs  (the  starred  ones).  These  stars  are 
propagated  up,  so  that  any  discourse  unit  that  has 
a  descendent  considered  important  is  also  consid¬ 
ered  important.  Erom  these  annotations,  we  could 
deduce  that,  to  compress  a  Nuc=Contrast  that  has 
two  children,  Nuc=Span  and  Sat=evaluation,  we 
can  drop  the  evaluation  satellite.  Similarly,  we  can 
compress  a  Nuc=Contrast  that  has  two  children, 
Sat=condition  and  Nuc=Span  by  dropping  the  first 
discourse  constituent.  Einally,  we  can  compress  the 
Root  deriving  into  Sat=Background  Nuc=Span  by 
dropping  the  Sat=Background  constituent.  We  keep 
counts  of  each  of  these  examples  and,  once  col¬ 
lected,  we  normalize  them  to  get  the  discourse  ex¬ 
pansion  probabilities. 

3.3  Decoder 

The  goal  of  the  decoder  is  to  combine  P{S)  with 
P{D\S)  to  get  P{S\D).  There  are  a  vast  number 
of  potential  compressions  of  a  large  DS-tree,  but 


we  can  efficiently  pack  them  into  a  shared-forest 
structure,  as  described  in  detail  by  Knight  &  Marcu 
(2000).  Each  entry  in  the  shared-forest  structure  has 
three  associated  probabilities,  one  from  the  source 
syntax  PCEG,  one  from  the  source  discourse  PCEG 
and  one  from  the  expansion-template  probabilities 
described  in  Section  3.2.  Once  we  have  generated  a 
forest  representing  all  possible  compressions  of  the 
original  document,  we  want  to  extract  the  best  (or 
the  n-best)  trees,  taking  into  account  both  the  ex¬ 
pansion  probabilities  of  the  channel  model  and  the 
bigram  and  syntax  and  discourse  PCEG  probabili¬ 
ties  of  the  source  model.  Thankfully,  such  a  generic 
extractor  has  already  been  built  (Eangkilde,  2000). 
Eor  our  purposes,  the  extractor  selects  the  trees  with 
the  best  combination  of  EM  and  expansion  scores 
after  performing  an  exhaustive  search  over  all  possi¬ 
ble  summaries.  It  returns  a  list  of  such  trees,  one  for 
each  possible  length. 

4  System 

The  system  developed  works  in  a  pipelined  fash¬ 
ion  as  shown  in  Eigure  3.  The  first  step  along  the 
pipeline  is  to  generate  the  discourse  structure.  To 
do  this,  we  use  the  decision-based  discourse  parser 
described  by  Marcu  (2000)^.  Once  we  have  the  dis¬ 
course  structure,  we  send  each  EDU  off  to  a  syn- 

^The  discourse  parser  achieves  an  f-score  of  38.2  for  EDU 
identification,  50.0  for  identifying  hierarchical  spans,  39.9  for 
nuclearity  identification  and  23.4  for  relation  tagging. 


Input  Document 


Figure  3:  The  pipeline  of  system  eomponents. 


tactic  parser  (Collins,  1997).  The  syntax  trees  of 
the  EDUs  are  then  merged  with  the  diseourse  tree 
in  the  forest  generator  to  ereate  a  DS-tree  similar  to 
that  shown  in  Figure  1.  From  this  DS-tree  we  gener¬ 
ate  a  forest  that  subsumes  all  possible  eompressions. 
This  forest  is  then  passed  on  to  the  forest  ranking 
system  whieh  is  used  as  decoder  (Fangkilde,  2000). 
The  deeoder  gives  us  a  list  of  possible  eompressions, 
for  eaeh  possible  length.  Example  eompressions  of 
Text  (1)  are  shown  in  Figure  4  together  with  their 
respeetive  log-probabilities. 

In  order  to  ehoose  the  “best”  eompression  at 
any  possible  length,  we  eannot  rely  only  on  the 
log-probabilities,  lest  the  system  always  ehoose  the 
shortest  possible  eompression.  In  order  to  eompen- 
sate  for  this,  we  normalize  by  length.  However,  in 
praetiee,  simply  dividing  the  log-probability  by  the 
length  of  the  eompression  is  insuffieient  for  longer 
doeuments.  Experimentally,  we  found  a  reasonable 
metrie  was  to,  for  a  eompression  of  length  n,  divide 
eaeh  log-probability  by  This  was  the  job  of 

the  length  chooser  from  Figure  3,  and  enabled  us 
to  ehoose  a  single  eompression  for  eaeh  doeument, 
whieh  was  used  for  evaluation.  (In  Figure  4,  the 
eompression  ehosen  by  the  length  seleetor  is  itali- 
eized  and  was  the  shortest  one^.) 

5  Results 

For  testing,  we  began  with  two  sets  of  data.  The 
first  set  is  drawn  from  the  Wall  Street  Journal  (WSJ) 
portion  of  the  Penn  Treebank  and  eonsists  of  16  doe¬ 
uments,  eaeh  eontaining  between  41  and  87  words. 
The  seeond  set  is  drawn  from  a  eolleetion  of  stu- 

^This  tends  to  be  the  case  for  very  short  documents,  as  the 
compressions  never  get  sufficiently  long  for  the  length  normal¬ 
ization  to  have  an  effect. 


dent  eompositions  and  eonsists  of  5  doeuments,  eaeh 
eontaining  between  64  and  91  words.  We  eall  this 
set  the  MITRE  eorpus  (Hirsehman  et  ah,  1999).  We 
would  liked  to  have  run  evaluations  on  longer  doeu¬ 
ments.  Unfortunately,  the  forests  generated  even  for 
relatively  small  doeuments  are  huge.  Beeause  there 
are  an  exponential  number  of  summaries  that  ean  be 
generated  for  any  given  text^,  the  deeoder  runs  out 
of  memory  for  longer  doeuments;  therefore,  we  se- 
leeted  shorter  subtexts  from  the  original  doeuments. 

We  used  both  the  WSJ  and  Mitre  data  for  eval¬ 
uation  beeause  we  wanted  to  see  whether  the  per- 
formanee  of  our  system  varies  with  text  genre.  The 
Mitre  data  eonsists  mostly  of  short  sentenees  (av¬ 
erage  doeument  length  from  Mitre  is  6  sentenees), 
quite  in  eonstrast  to  the  typieally  long  sentenees  in 
the  Wall  Street  Journal  artieles  (average  doeument 
length  from  WSJ  is  3.25  sentenees). 

For  purpose  of  eomparison,  the  Mitre  data  was 
eompressed  using  five  systems: 

Random:  Drops  random  words  (eaeh  word  has  a 
50%  ehanee  of  being  dropped  (baseline). 

Hand:  Hand  eompressions  done  by  a  human. 

Concat:  Eaeh  sentenee  is  eompressed  individually; 
the  results  are  eoneatenated  together,  using 
Knight  &  Mareu’s  (2000)  system  here  for  eom¬ 
parison. 

EDU:  The  system  deseribed  in  this  paper. 

Sent:  Beeause  syntaetie  parsers  tend  not  to  work 
well  parsing  just  elauses,  this  system  merges 
together  leaves  in  the  diseourse  tree  whieh  are 
in  the  same  sentenee,  and  then  proeeeds  as  de¬ 
seribed  in  this  paper. 

The  Wall  Street  Journal  data  was  evaluated  on  the 
above  five  sysfems  as  well  as  fwo  additions.  Sinee 
fhe  eorreef  diseourse  frees  were  known  for  fhese 
dafa,  we  fhoughf  if  wise  fo  fesf  fhe  sysfems  using 
fhese  human-builf  diseourse  frees,  insfead  of  fhe  au- 
fomafieally  derived  ones.  The  additional!  fwo  sys¬ 
tems  were: 

PD-EDU:  Same  as  EDU  exeepf  using  fhe  perfeel 
diseourse  frees,  available  from  fhe  RST  eorpus 
(Carlson  el  ah,  2001). 

‘*In  theory,  a  text  of  n  words  has  2"  possible  compressions. 


len  log  prob 

8  -118.9060 

13  -137.1010 

16  -147.5970 

18  -160.4310 

22  -176.1990 

28  -239.9490 


best  compression 

Mayor  is  now  looking  which  is  enough. 

The  mayor  is  now  looking  which  is  already  almost  enough  to  win. 

The  mayor  is  now  looking  but  without  support,  he  is  still  on  shaky  ground. 

Mayor  is  now  looking  but  without  the  support  of  governer,  he  is  still  on  shaky  ground. 

The  mayor  is  now  looking  for  re-election  but  without  the  support  of  the  governer,  he  is  still  on  shaky 
ground. 

The  mayor  is  now  looking  which  is  already  almost  enough  to  win.  But  without  the  support  of  the 
governer,  he  is  still  on  shaky  ground. 


Figure  4:  Possible  eompressions  for  Text  (1). 


PD-Sent:  The  same  as  Sent  exeept  using  the  perfeet 
diseourse  trees. 

Six  human  evaluators  rated  the  systems  aeeording  to 
three  metries.  The  first  two,  presented  together  to 
the  evaluators,  were  grammatieality  and  eoherenee; 
the  third,  presented  separately,  was  summary  qual¬ 
ity.  Grammatieality  was  a  judgment  of  how  good 
the  English  of  the  eompressions  were;  eoherenee 
ineluded  how  well  the  eompression  flowed  (for  in- 
stanee,  anaphors  laeking  an  anteeedent  would  lower 
eoherenee).  Summary  quality,  on  the  other  hand, 
was  a  judgment  of  how  well  the  eompression  re¬ 
tained  the  meaning  of  the  original  doeument.  Eaeh 
measure  was  rated  on  a  seale  from  1  (worst)  to  5 
(best). 

We  ean  draw  several  eonelusions  from  the  eval¬ 
uation  results  shown  in  Table  2  along  with  aver¬ 
age  eompression  rate  {Cmp,  the  length  of  the  eom- 
pressed  doeument  divided  by  the  original  length).^ 
Eirst,  it  is  elear  that  genre  influenees  the  results. 
Beeause  the  Mitre  data  eontained  mostly  short  sen- 
tenees,  the  syntax  and  diseourse  parsers  made  fewer 
errors,  whieh  allowed  for  better  eompressions  to  be 
generated.  Eor  the  Mitre  eorpus,  eompressions  ob¬ 
tained  starting  from  diseourse  trees  built  above  the 
sentenee  level  were  better  than  eompressions  ob¬ 
tained  starting  from  diseourse  trees  built  above  the 
EDU  level.  Eor  the  WSJ  eorpus,  eompression  ob¬ 
tained  starting  from  diseourse  trees  built  above  the 
sentenee  level  were  more  grammatieal,  but  less  eo- 
herent  than  eompressions  obtained  starting  from  dis¬ 
eourse  trees  built  above  the  EDU  level.  Choosing  the 
manner  in  whieh  the  diseourse  and  syntaetie  repre¬ 
sentations  of  texts  are  mixed  should  be  influeneed  by 
the  genre  of  the  texts  one  is  interested  to  eompress. 

^We  did  not  run  the  system  on  the  MITRE  data  with  perfect 
discourse  trees  because  we  did  not  have  hand-built  discourse 
trees  for  this  corpus. 


WSJ 

Cmp  Grm  Coh  Qual 

Mitre 

Cmp  Grm  Coh  Qual 

Random 

0.51 

1.60 

1.58 

2.13 

0.47 

1.43 

1.77 

1.80 

Concat 

0.44 

3.30 

2.98 

2.70 

0.42 

2.87 

2.50 

2.08 

EDU 

0.49 

3.36 

3.33 

3.03 

0.47 

3.40 

3.30 

2.60 

Sent 

0.47 

3.45 

3.16 

2.88 

0.44 

4.27 

3.63 

3.36 

PD-EDU 

0.47 

3.61 

3.23 

2.95 

PD-Sent 

0.48 

3.96 

3.65 

2.84 

Hand 

0.59 

4.65 

4.48 

4.53 

0.46 

4.97 

4.80 

4.52 

Table  2:  Evaluation  Results 


The  eompressions  obtained  starting  from  per- 
feetly  derived  diseourse  trees  indieate  that  perfeet 
diseourse  struetures  help  greatly  in  improving  eoher¬ 
enee  and  grammatieality  of  generated  summaries.  It 
was  surprising  to  see  that  the  summary  quality  was 
affeeted  negatively  by  the  use  of  perfeet  diseourse 
struetures  (although  not  statistieally  signifieant).  We 
believe  this  happened  beeause  the  text  fragments  we 
summarized  were  extraeted  from  longer  doeuments. 
It  is  likely  that  had  the  diseourse  struetures  been  built 
speeifieally  for  these  short  text  snippets,  they  would 
have  been  different.  Moreover,  there  was  no  eompo- 
nent  designed  to  handle  eohesion;  thus  it  is  to  be  ex- 
peeted  that  many  eompressions  would  eontain  dan¬ 
gling  referenees. 

Overall,  all  our  systems  outperformed  both  the 
Random  baseline  and  the  Coneat  systems,  whieh 
empirieally  show  that  diseourse  has  an  important 
role  in  doeument  summarization.  We  performed  t- 
tests  on  the  results  and  found  that  on  the  Wall  Street 
Journal  data,  the  differenees  in  seore  between  the 
Coneat  and  Sent  systems  for  grammatieality  and 
eoherenee  were  statistieally  signifieant  at  the  95% 
level,  but  the  differenee  in  seore  for  summary  quality 
was  not.  Eor  the  Mitre  data,  the  differenees  in  seore 
between  the  Coneat  and  Sent  systems  for  grammati- 
eality  and  summary  quality  were  statistieally  signif¬ 
ieant  at  the  95%  level,  but  the  differenee  in  seore  for 


coherence  was  not.  The  score  differences  for  gram- 
maticality,  coherence,  and  summary  quality  between 
our  systems  and  the  baselines  were  statistically  sig¬ 
nificant  at  the  95%  level. 

The  results  in  Table  2,  which  can  be  also  as¬ 
sessed  by  inspecting  the  compressions  in  Figure  4 
show  that,  in  spite  of  our  success,  we  are  still  far 
away  from  human  performance  levels.  An  error  that 
our  system  makes  often  is  that  of  dropping  comple¬ 
ments  that  cannot  be  dropped,  such  as  the  phrase 
“for  re-election”,  which  is  the  complement  of  “is 
looking”.  We  are  currently  experimenting  with  lex- 
icalized  models  of  syntax  that  would  prevent  our 
compression  system  from  dropping  required  verb  ar¬ 
guments.  We  also  consider  methods  for  scaling  up 
the  decoder  to  handling  documents  of  more  realistic 
length. 
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